Microbial Ecology - The Theory

In ecology, scientists collect quantitative data to determine the resemblance between either the objects under study (objects) or the variables describing them (species or other descriptors). The analysis of the association between objects and descriptors is the first step in the analysis.

There are three main types of association:

After creating these measures of association, common tasks are to cluster or to ordinate.

To cluster: Sets of objects (or descriptors) can be partitioned into two or more subsets (clusters) using pre-established rules of agglomeration.

To ordinate: Objects (or descriptors) are represented in a space containing fewer dimensions than the original dataset such that the positions of the objects (or descriptors) with respect to one another may also be used to cluster them.

Microbial Ecology - The Practice

As scientists interested in the microbiome, we are now all microbial ecologists. We typically talk about objects as subjects from whom samples have been taken (or even sites within subjects, e.g., oral, vaginal, rectal) while the descriptors are the list of abundances of each species (or other taxonomic classification) found.

So we become interested in how similar subjects are to each other, say, within group i, and how distant they are from, say, group j. Then we might want to know if that distance is associated with some other difference between the groups, for instance, are groups that are closer together in site pH levels also more similar (i.e., less distant).

Other questions of interest: how do sites (or species) cluster together? Also, can you use a multivariate dimensionality reduction technique to describe differences while preserving distances?

Recall if it is appropriate to determine distances between two points with Euclidean distance, the ensuing multidimensional reduction technique is principal component analysis (PCA). If the data are counts then the distance should be calculated in terms of the \(\chi^2\) distance, and then the reduction technique is canonical component analysis (CCA). If the distance metrics take on some other value (e.g., dissimilarity), then principal coordinate analysis (PCoA), which is really PCA of the squared distance matrix.

These are the types of questions that we can use phyloseq to address with our OTUtables and our sample data.

Beyond the OTU Table

phyloseq is an R package for efficient interactive and reproducible analyses of OTU-clustered high-throughput phylogenetic sequencing data. We can take the sequence variant table produced by dada2 and use it, along with taxonomic classification and data about the samples, in these analyses. Phyloseq imports, stores, analyzes, and displays these data.

In our last lesson we showed how to hand off data to phyloseq from dada2, using the phyloseq command to create a phyloseq object from an OTU table, a taxonomic table, and a sample dataset.

In this lesson we will some of the other functionality of phyloseq:

Load Packages and Example Data

To get started we will need to load both phyloseq and ggplot2.

# Load packages

library(phyloseq)
packageVersion("phyloseq")
## [1] '1.19.1'
library(ggplot2) 
packageVersion("ggplot2")
## [1] '2.2.1'
library(RColorBrewer)
packageVersion("RColorBrewer")
## [1] '1.1.2'

Load example data using the data() command.

# Load example data

data(GlobalPatterns)
data(esophagus)
data(enterotype)
data(soilrep)

You can try these examples using the example() command.

# Use the example() command on the enterotype database

example(enterotype, ask=FALSE)
## 
## entrty> data(enterotype)
## 
## entrty> ig <- make_network(enterotype, "samples", max.dist=0.3)
## 
## entrty> plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
## Warning: attributes are not identical across measure variables; they will
## be dropped

You can also import data from other packages such as MG-RAST or QIIME. The newer versions of QIIME produce files formatted according to the biom standard (see http://biom-format.org).

The phyloseq package has small examples of biom files with different levels and organization of data. The following chunk illustrates how to import each of the 4 main types of biom files. Note that in practice you won’t know what type your file is, so best to include all 4. In addition, the import_biom funciton allows you to import an associated phylogenetic tree file and reference sequence file (e.g., fasta file).

To do this you must first define the file paths. For this example, this should be within the phyloseq package. Thus we use the system.file command to get the paths.

# Load the files
rich_dense_biom  <- system.file("extdata", "rich_dense_otu_table.biom",  package="phyloseq")
rich_sparse_biom <- system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq")
min_dense_biom   <- system.file("extdata", "min_dense_otu_table.biom",   package="phyloseq")
min_sparse_biom  <- system.file("extdata", "min_sparse_otu_table.biom",  package="phyloseq")
treefilename <- system.file("extdata", "biom-tree.phy",  package="phyloseq")
refseqfilename <- system.file("extdata", "biom-refseq.fasta",  package="phyloseq")

Next we use these paths as argments to the import_biom function. The tree and reference sequence files are both suitable for any of the biom file types, so we only need one path for each.

# Import via the paths

import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]
import_biom(rich_sparse_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]
import_biom(min_dense_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]
import_biom(min_sparse_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]

In practice you will store the result of your import as some variable name, like myData, then use this object in downstream analyses. For examples,

# Example of storing results of import and manipulating it

myData <- import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
myData
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]
plot_tree(myData, color = "Genus", shape = "BODY_SITE", size = "abundance")

plot_richness(myData, x="BODY_SITE", color="Description")
## Warning: Removed 33 rows containing missing values (geom_errorbar).

plot_bar(myData, fill="Genus")

refseq(myData)
##   A DNAStringSet instance of length 5
##     width seq                                          names               
## [1]   334 AACGTAGGTCACAAGCGTTGT...TTCCGTGCCGGAGTTAACAC GG_OTU_1
## [2]   465 TACGTAGGGAGCAAGCGTTAT...CCTTACCAGGGCTTGACATA GG_OTU_2
## [3]   249 TACGTAGGGGGCAAGCGTTAT...GGCTCGAAAGCGTGGGGAGC GG_OTU_3
## [4]   453 TACGTATGGTGCAAGCGTTAT...AAGCAACGCGAAGAACCTTA GG_OTU_4
## [5]   178 AACGTAGGGTGCAAGCGTTGT...GGAATGCGTAGATATCGGGA GG_OTU_5

The good people at phyloseq have also downloaded data from the Human Microbiome Project and made it available as an R dataset. So let’s look at it

# Get the HMP data
# You will need to change your path

load("~/Documents/NRSG_741/Lesson12/HMPv35.RData")

HMPv35
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 45336 taxa and 4743 samples ]
## sample_data() Sample Data:       [ 4743 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 45336 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 45336 tips and 45099 internal nodes ]
## refseq()      DNAStringSet:      [ 45336 reference sequences ]

Merge Data

phyloseq supports two types of merging.

  1. You can merge OTUs or samples in a phyloseq object, based on a taxonomic or sample variable using the merge_samples() or merge_taxa() commands.
  2. YOu can merge two or more data objects from the same experiment so that their data become part of the same phyloseq object using the merge_phyloseq() commend.

Merging Samples

Merging samples with merge_samples() can be a useful means of reducing noise or excess features, say by removing the individual effects between replicates or between samples for a particular explanatory variable. With merge_samples() the abundance values for the merged samples are summed.

As an example, let’s first remove unobserved OTUs (those that sum 0 across all samples) and add an explanatory variable with which to organize later in plots, using the GlobalPatterns dataset.

# Remove empty samples

GP <- GlobalPatterns
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
humantypes <- c("Feces", "Mock", "Skin", "Tongue")
sample_data(GP)$human <- get_variable(GP, "SampleType") %in% humantypes

Now on to merging samples:

# Merge Samples

mergedGP <- merge_samples(GP, "SampleType")
SD <- merge_samples(sample_data(GP), "SampleType")
print(SD[, "SampleType"])
##                    SampleType
## Feces                       1
## Freshwater                  2
## Freshwater (creek)          3
## Mock                        4
## Ocean                       5
## Sediment (estuary)          6
## Skin                        7
## Soil                        8
## Tongue                      9
print(mergedGP)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18988 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 18988 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 18988 tips and 18987 internal nodes ]
sample_names(GP)
##  [1] "CL3"      "CC1"      "SV1"      "M31Fcsw"  "M11Fcsw"  "M31Plmr" 
##  [7] "M11Plmr"  "F21Plmr"  "M31Tong"  "M11Tong"  "LMEpi24M" "SLEpi20M"
## [13] "AQC1cm"   "AQC4cm"   "AQC7cm"   "NP2"      "NP3"      "NP5"     
## [19] "TRRsed1"  "TRRsed2"  "TRRsed3"  "TS28"     "TS29"     "Even1"   
## [25] "Even2"    "Even3"
sample_names(mergedGP)
## [1] "Feces"              "Freshwater"         "Freshwater (creek)"
## [4] "Mock"               "Ocean"              "Sediment (estuary)"
## [7] "Skin"               "Soil"               "Tongue"
identical(SD, sample_data(mergedGP))
## [1] TRUE

The OTU abundances of merged samples are summed. Let’s investigate this ourselves looking at just the top 10 most abundant OTUs.

# Look at top 10 most abundant OTUs

OTUnames10 <- names(sort(taxa_sums(GP), TRUE)[1:10])
GP10 <- prune_taxa(OTUnames10, GP)
mGP10 <- prune_taxa(OTUnames10,mergedGP)
ocean_samples <- sample_names(subset(sample_data(GP), SampleType == "Ocean"))
print(ocean_samples)
## [1] "NP2" "NP3" "NP5"
otu_table(GP10)[, ocean_samples]
## OTU Table:          [10 taxa and 3 samples]
##                      taxa are rows
##         NP2   NP3   NP5
## 329744   91   126   120
## 317182 3148 12370 63084
## 549656 5045 10713  1784
## 279599  113   114   126
## 360229   16    83   786
## 94166    49   128   709
## 550960   11    86    65
## 158660   13    39    28
## 331820   24   101   105
## 189047    4    33    29
rowSums(otu_table(GP10)[, ocean_samples])
## 329744 317182 549656 279599 360229  94166 550960 158660 331820 189047 
##    337  78602  17542    353    885    886    162     80    230     66
otu_table(mGP10)["Ocean", ]
## OTU Table:          [10 taxa and 1 samples]
##                      taxa are columns
##       329744 317182 549656 279599 360229 94166 550960 158660 331820 189047
## Ocean    337  78602  17542    353    885   886    162     80    230     66

Now let’s look at the merge graphically between two richnes estimate summary plots.

# Richness estimate plots

plot_richness(GP, "human", "SampleType", title="Unmerged")
## Warning: Removed 130 rows containing missing values (geom_errorbar).

The merge can do some weird things to samples variables, so let’s re-add these variables to the sample_data before we plot.

# Re-add sample data and plot richness plots again.

sample_data(mergedGP)$SampleType <- sample_names(mergedGP)
sample_data(mergedGP)$human <- sample_names(mergedGP) %in% humantypes
plot_richness(mergedGP, "human", "SampleType", title="merged")
## Warning: Removed 45 rows containing missing values (geom_errorbar).

When we combine the abundances of non-replicate samples from the same environment, the estimates of absolute richness increase for each environment. But, more to the point, the new plot is easier to read and interpret, one of the reasons one might use the merge_samples function.

Merging Taxa

One of the issues with microbial census data is that a fine-scaled definition for OTU may blur a pattern that might otherwise be evident if we considered a higher taxonomic rank. The best way to deal with this is to agglomerate phylogenetic tree tips or taxa by using the agglomeration functions tip_glom or tax_glom. Let’s use the object we imported earlier from the biom format files, myData. First the original tree:

# Plot phylogenetic tree of GlobalPatterns object
plot_tree(GP, color="SampleType", sizebase=2, label.tips="taxa_name")

What a mess. Let’s try using merge_taxa to merge the first 100 OTUs into one new OTU. The option 2 in the call means to combine the counts of these 100 OTUs into the index for the new OTU.

# Combine the tree tips and plot again

x1 <- merge_taxa(GP, taxa_names(GP)[1:100], 2)
plot_tree(x1, color="SampleType", sizebase=2, label.tips="taxa_name")

You can barely see the difference, but it is there.

Merging Objects

You can also merge phyloseq objects. We will demonstrate with a somewhat trivial example by extracting the components of the GlobalPatterns object and then building them back to the original form using the command merge_phyloseq.

# Split apart GlobalPatterns

data(GlobalPatterns)
tree <- phy_tree(GlobalPatterns)
tax <- tax_table(GlobalPatterns)
otu <- otu_table(GlobalPatterns)
sam <- sample_data(GlobalPatterns)

# Create new phyloseq object

otutax <- phyloseq(otu, tax)
otutax
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]

You can se that our new otutax object has just the OTU table and the taxonomy table. Let’s use merge_phyloseq to build up the original GlobalPatterns object, and compare to make sure they are identical.

# Build back up

GP2 = merge_phyloseq(otutax, sam, tree)

# Test for identity

identical(GP2, GlobalPatterns)
## [1] TRUE

Accessing and (Pre)Processing Data

Accessors

We have already seen a bit above about how to access some of the data within a phyloseq object. We will comprehensively cover it now.

Let’s use the GlobalPatterns dataset.

# Load data object and see what is there

data("GlobalPatterns")
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
ntaxa(GlobalPatterns)
## [1] 19216
nsamples(GlobalPatterns)
## [1] 26
sample_names(GlobalPatterns)[1:5]
## [1] "CL3"     "CC1"     "SV1"     "M31Fcsw" "M11Fcsw"
rank_names(GlobalPatterns)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
sample_variables(GlobalPatterns)
## [1] "X.SampleID"               "Primer"                  
## [3] "Final_Barcode"            "Barcode_truncated_plus_T"
## [5] "Barcode_full_length"      "SampleType"              
## [7] "Description"
otu_table(GlobalPatterns)[1:5, 1:5]
## OTU Table:          [5 taxa and 5 samples]
##                      taxa are rows
##        CL3 CC1 SV1 M31Fcsw M11Fcsw
## 549322   0   0   0       0       0
## 522457   0   0   0       0       0
## 951      0   0   0       0       0
## 244423   0   0   0       0       0
## 586076   0   0   0       0       0
tax_table(GlobalPatterns)[1:5, 1:4]
## Taxonomy Table:     [5 taxa by 4 taxonomic ranks]:
##        Kingdom   Phylum          Class          Order         
## 549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA            
## 522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA            
## 951    "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales"
## 244423 "Archaea" "Crenarchaeota" "Sd-NA"        NA            
## 586076 "Archaea" "Crenarchaeota" "Sd-NA"        NA
phy_tree(GlobalPatterns)
## 
## Phylogenetic tree with 19216 tips and 19215 internal nodes.
## 
## Tip labels:
##  549322, 522457, 951, 244423, 586076, 246140, ...
## Node labels:
##  , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 1.000.2, ...
## 
## Rooted; includes branch lengths.
taxa_names(GlobalPatterns)[1:10]
##  [1] "549322" "522457" "951"    "244423" "586076" "246140" "143239"
##  [8] "244960" "255340" "144887"
myTaxa <- names(sort(taxa_sums(GlobalPatterns), decreasing = TRUE)[1:10])
ex1 <- prune_taxa(myTaxa, GlobalPatterns)
plot(phy_tree(ex1), show.node.label = TRUE)

plot_tree(ex1, color = "SampleType", label.tips = "Phylum", ladderize = "left", justify = "left", size = "Abundance")

Preprocessing

phyloseq also includes functions for filtering, subsetting, and merging abundance data. Let’s filter!

Filtering

Filtering is designed in a modular fashion. The functions prune_taxa and prune_samples will directly remove unwanted indices. The functions filterfun_sample and genefilter_sample will allow you to build arbitrarily complex sample-wise filtering criteria. The function filter_taxa allows for taxa-wise filtering.

In the following example we will use the GlobalPatterns object again, first transforming it to relative abundance in a new GPr object, then filtering that object to remove all OTUs that have a mean abundance < \(10^-5\), creating another new object.

# Create the object with the relative abundance data

GPr <- transform_sample_counts(GlobalPatterns, function(x) x / sum(x))
GPfr <- filter_taxa(GPr, function(x) mean(x) > 1e-5, TRUE)

GlobalPatterns
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
GPr
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
GPfr
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4624 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 4624 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4624 tips and 4623 internal nodes ]

Subsetting

We see that the final object, GPfr is highly subsetted containing just a little less than a quarter of the original OTUs (4624 / 19216).

The functions prune_sample and prune_taxa are to be used in cases where the complete subset of OTUS or samples is directly available. Alternatively the subsets_taxa and subset_samples are best for subsetting based on other data contained in the taxonomy and sample datasets respectively.

As an example we will obtain the subset of the GlobalPatterns dataset that is part of the Chlamydiae phylum, then remove samples with less than 20 reads.

# Subset for Chlamydiae

GP.ch1 <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")

# Eliminate samples whose total reads are less than 20

GP.ch1 <- prune_samples(sample_sums(GP.ch1) >= 20, GP.ch1)

GP.ch1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 21 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 21 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 21 tips and 20 internal nodes ]
plot_tree(GP.ch1)

Merging

Merging methods include merge_taxa and merge_samples for merging specific OTUs or samples, respectively.

The following chunk shows the merge of the first 5 OTUs in teh Chlamydiae-only dataset.

# Merge first 5 OTUSs in Chlamydiae-only dataset

GP.ch1.merged <- merge_taxa(GP.ch1, taxa_names(GP.ch1)[1:5])

Let’s do one final set of preprocessing steps

# Final preprocess for GlobalPatterns: filter out taxa not see more than 3 times in at least
# 20% of samples

GP <- filter_taxa(GlobalPatterns, function(x) sum(x > 3) > (0.2*length(x)), TRUE)

# Define human v non-human and add to sample data:

sample_data(GP)$human <- factor(get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue"))

# Standardize abundances to median sequencing depth

total <- median(sample_sums(GP))
standf <- function(x, t=total) round(t * (x / sum(x)))
gps <- transform_sample_counts(GP, standf)

# Filter out taxa with CV > 3.0

gpsf <- filter_taxa(gps, function(x) sd(x) / mean(x) > 3.0, TRUE)

# Subset to Bacteroidetes

gpsfb <- subset_taxa(gpsf, Phylum == "Bacteroidetes")

Graphical Display

Let’s graph this slice of data

# Graph the slice of data

# First a simple bar graph

title = "plot_bar; Bacteroidetes-only"
plot_bar(gpsfb, "SampleType", "Abundance", title = title)

# Now colored by family

plot_bar(gpsfb, "SampleType", "Abundance", "Family", title = title)

# Now faceted with type of samples

plot_bar(gpsfb, "Family", "Abundance", "Family", title = title, facet_grid = "SampleType~.")

You can also define your own graphics “from scratch” using psmelt Here are two examples:

# Example 1
mdf <- psmelt(gpsfb)
# Simple bar plot

ggplot(mdf, aes(x=SampleType,
                y=Abundance)) +
  geom_bar(stat="identity", position = "stack", color="black")

# Simple heat map

ggplot(mdf, aes(x=SampleType,
                y=OTU,
                fill=Abundance)) +
  geom_raster()

Distance Functions

The distance function takes a phyloseq object and a method and returns a dist-class distance object. Currently phyloseq supports 44 different method options. For the complete list, type distanceMethodList. Use the documentation on distance for further details.

The usage is as follows:

distance(phyloseq-object, method=“methodname”, type=“type”)

Because this function organizes distance calculations into one function, it is easy to calculate all distance methods and investigate the results. We are going to do so on the Enterotypes dataset, then perform multidimensional scaling, plot the first two axes, shading and shaping the points in each plot according to sequencing technology and assigned “Enterotype” label.

First some preliminaries, loading plyr, setting the ggplot2 theme, and loading the dataset:

# get plyr
library(plyr)

# set ggplot2 theme
theme_set(theme_bw())

# load enterotype data
data(enterotype)

Some preliminary filtering to remove OTUs that have unassigned sequences (“-1”):

# Remove unassigned OTUs

enterotype <- subset_taxa(enterotype, Genus != "-1")

Let’s see the available distance methods coded in distance:

dist_methods <- unlist(distanceMethodList)
print(dist_methods)
##     UniFrac1     UniFrac2        DPCoA          JSD     vegdist1 
##    "unifrac"   "wunifrac"      "dpcoa"        "jsd"  "manhattan" 
##     vegdist2     vegdist3     vegdist4     vegdist5     vegdist6 
##  "euclidean"   "canberra"       "bray" "kulczynski"    "jaccard" 
##     vegdist7     vegdist8     vegdist9    vegdist10    vegdist11 
##      "gower"   "altGower"   "morisita"       "horn"  "mountford" 
##    vegdist12    vegdist13    vegdist14    vegdist15   betadiver1 
##       "raup"   "binomial"       "chao"        "cao"          "w" 
##   betadiver2   betadiver3   betadiver4   betadiver5   betadiver6 
##         "-1"          "c"         "wb"          "r"          "I" 
##   betadiver7   betadiver8   betadiver9  betadiver10  betadiver11 
##          "e"          "t"         "me"          "j"        "sor" 
##  betadiver12  betadiver13  betadiver14  betadiver15  betadiver16 
##          "m"         "-2"         "co"         "cc"          "g" 
##  betadiver17  betadiver18  betadiver19  betadiver20  betadiver21 
##         "-3"          "l"         "19"         "hk"        "rlb" 
##  betadiver22  betadiver23  betadiver24        dist1        dist2 
##        "sim"         "gl"          "z"    "maximum"     "binary" 
##        dist3   designdist 
##  "minkowski"        "ANY"

Remove distance methods that depend on a phylogenetic tree, because the enterotype object that we started with does not have one. These would be the two unifrac methods and DPCoA.

# Remove methods that need a phylogenetic tree

dist_methods[(1:3)]
##   UniFrac1   UniFrac2      DPCoA 
##  "unifrac" "wunifrac"    "dpcoa"
dist_methods <- dist_methods[-(1:3)]
print(dist_methods)
##          JSD     vegdist1     vegdist2     vegdist3     vegdist4 
##        "jsd"  "manhattan"  "euclidean"   "canberra"       "bray" 
##     vegdist5     vegdist6     vegdist7     vegdist8     vegdist9 
## "kulczynski"    "jaccard"      "gower"   "altGower"   "morisita" 
##    vegdist10    vegdist11    vegdist12    vegdist13    vegdist14 
##       "horn"  "mountford"       "raup"   "binomial"       "chao" 
##    vegdist15   betadiver1   betadiver2   betadiver3   betadiver4 
##        "cao"          "w"         "-1"          "c"         "wb" 
##   betadiver5   betadiver6   betadiver7   betadiver8   betadiver9 
##          "r"          "I"          "e"          "t"         "me" 
##  betadiver10  betadiver11  betadiver12  betadiver13  betadiver14 
##          "j"        "sor"          "m"         "-2"         "co" 
##  betadiver15  betadiver16  betadiver17  betadiver18  betadiver19 
##         "cc"          "g"         "-3"          "l"         "19" 
##  betadiver20  betadiver21  betadiver22  betadiver23  betadiver24 
##         "hk"        "rlb"        "sim"         "gl"          "z" 
##        dist1        dist2        dist3   designdist 
##    "maximum"     "binary"  "minkowski"        "ANY"
# This is the unser-defined method:

dist_methods["designdist"]
## designdist 
##      "ANY"
# Remove it

dist_methods <- dist_methods[-which(dist_methods == "ANY")]
print(dist_methods)
##          JSD     vegdist1     vegdist2     vegdist3     vegdist4 
##        "jsd"  "manhattan"  "euclidean"   "canberra"       "bray" 
##     vegdist5     vegdist6     vegdist7     vegdist8     vegdist9 
## "kulczynski"    "jaccard"      "gower"   "altGower"   "morisita" 
##    vegdist10    vegdist11    vegdist12    vegdist13    vegdist14 
##       "horn"  "mountford"       "raup"   "binomial"       "chao" 
##    vegdist15   betadiver1   betadiver2   betadiver3   betadiver4 
##        "cao"          "w"         "-1"          "c"         "wb" 
##   betadiver5   betadiver6   betadiver7   betadiver8   betadiver9 
##          "r"          "I"          "e"          "t"         "me" 
##  betadiver10  betadiver11  betadiver12  betadiver13  betadiver14 
##          "j"        "sor"          "m"         "-2"         "co" 
##  betadiver15  betadiver16  betadiver17  betadiver18  betadiver19 
##         "cc"          "g"         "-3"          "l"         "19" 
##  betadiver20  betadiver21  betadiver22  betadiver23  betadiver24 
##         "hk"        "rlb"        "sim"         "gl"          "z" 
##        dist1        dist2        dist3 
##    "maximum"     "binary"  "minkowski"

Now we can loop through this list, calculate distances according to each methods, plot the distances, and save each plot to a plist (plist).

# Prepare to receive plots
plist <- vector("list", length(dist_methods))
names(plist) <- dist_methods

for(i in dist_methods ){
  # Calculate distance matrix
  iDist <- distance(enterotype, method=i)
  #Calculate ordination
  iMDS <- ordinate(enterotype, "MDS", distance = iDist)
  ## Make plot
  # Don't carry over previous plot (if errror, p will be blank)
  p <- NULL
  # Create plot, store as temp variable p
  p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
  # Add title to each plot
  p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
  # Save the graphic to file
  plist[[i]] <- p
}
## Warning in vegdist(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
## results may be meaningless with non-integer data in method "morisita"
## Warning in vegdist(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
## results may be meaningless with non-integer data in method "chao"
## Warning in vegdist(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
## results may be meaningless with non-integer data in method "cao"

Combine Results

Shade according to sequencing technology:

# Shade according to sequencing technology:

df <- ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p <- ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p <- p + geom_point(size = 3, alpha = 0.5)
p <- p + facet_wrap(~distance, scales="free")
p <- p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
## Warning: Removed 387 rows containing missing values (geom_point).

Now, shade instead according to assigned Enterotype

# Shade according to Enterotype:

df <- ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p <- ggplot(df, aes(Axis.1, Axis.2, color=Enterotype, shape=SeqTech))
p <- p + geom_point(size = 3, alpha = 0.5)
p <- p + facet_wrap(~distance, scales="free")
p <- p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p

Compare Results

Some selected examples among the created plots. First the Jensen-Shannon Divergence.

# Plot ordination with Jensen-Shannon Divergence

print(plist[["jsd"]])

Jaccard distance

# Plot ordination with Jaccard distance

print(plist[["jaccard"]])

Bray-Curtis

# Plot ordination with Bray-Curtis

print(plist[["bray"]])

Gower

# Plot ordination with Gower

print(plist[["gower"]])

w

# Plot ordination with w

print(plist[["w"]])

Gap Statistic

How Many Clusters Are There?

The clusGap function from the cluster package calculates a goodness of clustering measure, the “gap” statistic. For each number of clusters k, it compares W(k) with \(E[W(k)]\) where the latter is defined by bootstrapping.

Let’s give this a try:

First: Ordination

Well, really, first, let’s set things up, then ordinate:

# Set up

library(cluster)
packageVersion("cluster")
## [1] '2.0.6'
theme_set(theme_bw())
data(enterotype)

# Now ordinate
exord <- ordinate(enterotype, method = "MDS", distance = "jsd")

Compute Gap Statistic

# Compute gap statistic

paml = function(x, k){list(cluster=pam(x,k, cluster.only = TRUE))}
x = phyloseq:::scores.pcoa(exord, distplay="sites")
gskmn = clusGap(x[,1:2], FUN=paml, K.max = 6, B=50)
gskmn
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = x[, 1:2], FUNcluster = paml, K.max = 6, B = 50)
## B=50 simulated reference sets, k = 1..6; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##          logW   E.logW       gap     SE.sim
## [1,] 2.995599 3.111761 0.1161627 0.01736487
## [2,] 2.209852 2.769404 0.5595520 0.02029360
## [3,] 1.922188 2.584278 0.6620893 0.02645567
## [4,] 1.685798 2.411261 0.7254637 0.02517702
## [5,] 1.601025 2.277704 0.6766790 0.02373395
## [6,] 1.480640 2.175675 0.6950353 0.02357250

Plot Results

Define a plot method for results

# Define plot method

plot_glusgap <- function(clusgap, title="Gap Statistic calculation results"){
  require("ggplot2")
  gstab <- data.frame(clusgap$Tab, k=1:nrow(clusgap$Tab))
  p <- ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size=5)
  p <- p + geom_errorbar(aes(ymax = gap+SE.sim, ymin = gap - SE.sim))
  p <- p + ggtitle(title)
  return(p)
}

# Define a wrapper function

gap_statistic_ordination <- function(ord, FUNcluster, type="sites", K.max=6, axes=c(1:2), B=500, verbose=interactive(), ...){
    require("cluster")
    #   If "paml" was chosen, use this internally defined call to pam
    if(FUNcluster == "paml"){
        FUNcluster = function(x,k) list(cluster = pam(x, k, cluster.only=TRUE))     
    }
    # Use the scores function to get the ordination coordinates
    x = phyloseq:::scores.pcoa(ord, display=type)
    #   If axes not explicitly defined (NULL), then use all of them
    if(is.null(axes)){axes = 1:ncol(x)}
    #   Finally, perform, and return, the gap statistic calculation using cluster::clusGap  
    clusGap(x[, axes], FUN=FUNcluster, K.max=K.max, B=B, verbose=verbose, ...)
}

Now try out this function

# Make the plot

gs <- gap_statistic_ordination(exord, "paml", B=50, verbose = FALSE)
print(gs, method="Tibs2001SEmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = x[, axes], FUNcluster = FUNcluster, K.max = K.max,     B = B, verbose = verbose)
## B=50 simulated reference sets, k = 1..6; spaceH0="scaledPCA"
##  --> Number of clusters (method 'Tibs2001SEmax', SE.factor=1): 4
##          logW   E.logW       gap     SE.sim
## [1,] 2.995599 3.115285 0.1196868 0.02048797
## [2,] 2.209852 2.768789 0.5589367 0.02132839
## [3,] 1.922188 2.583195 0.6610064 0.02858008
## [4,] 1.685798 2.418597 0.7327996 0.02788483
## [5,] 1.601025 2.282590 0.6815649 0.01999317
## [6,] 1.480640 2.177968 0.6973283 0.02323917
plot_clusgap(gs)

For comparison, let’s see what happens if we only use plotting as available in base R:

# Base R plotting

plot(gs, main = "Gap statistic for the 'Enterotypes' data")
mtext("Looks like 4 clusters is best, with 3 and 5 as close runners-up")

Ordination Plots

We want to filter low-occurrence, poorly-represented OTUs from these data, because they are essentially noise variables for purposes of this lesson. In practice, you should probably perform clearly-document well-justified pre-processing steps.

First we want to remove OTUs that do not appear more than 5 times in more than half the samples.

# Remove OTUs that don't appear more than 5 times in more than half the samples

GP <- GlobalPatterns
wh0 <- genefilter_sample(GP, filterfun_sample(function(x) x>5), A = 0.5*nsamples(GP))
GP1 <- prune_taxa(wh0, GP)

Transform to an even sampling depth.

# Transform to an even sampling depth

GP1 <- transform_sample_counts(GP1, function(x) 1e6 * x/sum(x))

Keep only the 5 most abundant phyla.

# Keep the top 5 most abundant phyla  

phylum.sum <- tapply(taxa_sums(GP1), tax_table(GP1)[, "Phylum"], sum, na.rm = TRUE)
top5phyla <- names(sort(phylum.sum, TRUE))[1:5]
GP1 <- prune_taxa((tax_table(GP1)[, "Phylum"] %in% top5phyla), GP1)
summary(GP1)
##   Length    Class     Mode 
##        1 phyloseq       S4
GP1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 204 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 204 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 204 tips and 203 internal nodes ]

So we still have 204 taxa that are in the dataset GP1 that we will use going forward.

We will want to investigate human-associated microbiomes vs not, so we will define as follows:

# Define human-associate samples
human <- get_variable(GP1, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
sample_data(GP1)$human <- factor(human)

Four Main Types of Ordination Plots

There are four basic representations of an ordination.

Just OTUs

Let’s plot just the OTUs and shade by Phylum. Remember there will ntaxa(GP1) = 204 OTUs in this plot.

# Plot ordination for just OTUs

GP.ord <- ordinate(GP1, "NMDS", "bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1333468 
## Run 1 stress 0.1385324 
## Run 2 stress 0.1510167 
## Run 3 stress 0.1567703 
## Run 4 stress 0.1333468 
## ... Procrustes: rmse 8.608675e-06  max resid 2.50391e-05 
## ... Similar to previous best
## Run 5 stress 0.1465222 
## Run 6 stress 0.1558954 
## Run 7 stress 0.1333468 
## ... New best solution
## ... Procrustes: rmse 5.180405e-06  max resid 1.537846e-05 
## ... Similar to previous best
## Run 8 stress 0.1471619 
## Run 9 stress 0.1333468 
## ... Procrustes: rmse 5.376844e-06  max resid 1.460572e-05 
## ... Similar to previous best
## Run 10 stress 0.3771593 
## Run 11 stress 0.1385323 
## Run 12 stress 0.1576441 
## Run 13 stress 0.1578995 
## Run 14 stress 0.1518735 
## Run 15 stress 0.1452627 
## Run 16 stress 0.1333468 
## ... New best solution
## ... Procrustes: rmse 9.998651e-07  max resid 2.800828e-06 
## ... Similar to previous best
## Run 17 stress 0.3038969 
## Run 18 stress 0.1691099 
## Run 19 stress 0.1529609 
## Run 20 stress 0.1471619 
## *** Solution reached
p1 <- plot_ordination(GP1, GP.ord, type="taxa", color="Phylum", title = "taxa")
print(p1)

There is a lot going on here, and a lot of points are superimposed on one another. One way to deal with that is to use facetting.

# Facet-wrapping for previous plot

p1 + facet_wrap(~Phylum, 3)

Just Samples

Another way to plot is to consider the samples, and shade by “SampleType”, modifying shape according to if they are human associated.

# Plot just the Samples

p2 <- plot_ordination(GP1, GP.ord, type="samples", color="SampleType", shape="human")
p2 + geom_polygon(aes(fill=SampleType)) +
  geom_point(size=5) +
  ggtitle("samples")

Biplot Graphic of OTUs and Samples Together

The plot_ordination function can also automatically create two different graphic layouts in which both the sampels and OTUs are plotted together in one “biplot”. This will not work for methods that are intrinsically samples-only ordinations (e.g., UniFrac/PCoA).

# Biplot Graph

p3 <- plot_ordination(GP1, GP.ord, type="biplot", color="SampleType", shape="Phylum", title = "biplot")
 # The following will modify the automatic shape scale
GP1.shape.names <- get_taxa_unique(GP1, "Phylum")
GP1.shapes <- 15:(15 + length(GP1.shape.names) - 1)
names(GP1.shapes) <- GP1.shape.names
GP1.shapes["samples"] <- 16

p3 + scale_shape_manual(values=GP1.shapes)

Split Graphic

We have the same problem as before, with a lot of overlay that might obscure meanint. Let’s try to separate into side-by-side panels.

# Split graphic into panels

p4 <- plot_ordination(GP1, GP.ord, type="split", color="Phylum", shape = "human", label = "SampleType", title = "split")
## Warning: Ignoring unknown aesthetics: na.rm
p4

Ordination Methods

Now we are going o loop through different method parameter options to the plot_ordination function, store the plot results, then plot in a combined graphic.

# First use different ordination methods and plot to list

dist <- "bray"
ord_meths <- c("DCA", "CCA", "RDA", "DPCoA", "NMDS", "MDS", "PCoA")
plist <- llply(as.list(ord_meths), function(i, physeq, dist){
  ordi <- ordinate(physeq, method=i, distance = dist)
  plot_ordination(physeq, ordi, "samples", color="SampleType")
}, GP1, dist)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1333468 
## Run 1 stress 0.1385323 
## Run 2 stress 0.1696644 
## Run 3 stress 0.1333468 
## ... Procrustes: rmse 7.532724e-06  max resid 1.866483e-05 
## ... Similar to previous best
## Run 4 stress 0.1678653 
## Run 5 stress 0.1391747 
## Run 6 stress 0.1530178 
## Run 7 stress 0.1530178 
## Run 8 stress 0.1459312 
## Run 9 stress 0.1333468 
## ... Procrustes: rmse 7.902636e-06  max resid 2.69886e-05 
## ... Similar to previous best
## Run 10 stress 0.1521722 
## Run 11 stress 0.1333468 
## ... New best solution
## ... Procrustes: rmse 1.353412e-06  max resid 3.321661e-06 
## ... Similar to previous best
## Run 12 stress 0.1750778 
## Run 13 stress 0.1465222 
## Run 14 stress 0.1470663 
## Run 15 stress 0.1768068 
## Run 16 stress 0.1720605 
## Run 17 stress 0.1644737 
## Run 18 stress 0.1333468 
## ... Procrustes: rmse 7.376441e-06  max resid 2.13344e-05 
## ... Similar to previous best
## Run 19 stress 0.1333468 
## ... Procrustes: rmse 1.619777e-06  max resid 4.374305e-06 
## ... Similar to previous best
## Run 20 stress 0.1615348 
## *** Solution reached
names(plist) <- ord_meths

Now we will extract the data from each of the individual plots and put them back together in one big data.frame suitable for including all plots in one graphic.

# Extract data from each of the individual plots and put them all together in one big dataframe

pdataframe <- ldply(plist, function(x){
  df <- x$data[, 1:2]
  colnames(df) <- c("Axis_1", "Axis_2")
  return(cbind(df, x$data))
})

names(pdataframe)[1] <- "method"

Now with all ordination results combined in one dataframe, called pdataframe, we can make a standard faceted scatterplot with ggplot2.

# Make faceted scatterplot

p <- ggplot(pdataframe, aes(Axis_1, Axis_2, color = SampleType, shape=human, fill=SampleType))
p <- p + geom_point(size=4) +
          geom_polygon()
p <- p + facet_wrap(~method, scales="free")
p <- p + scale_fill_brewer(type="qual", palette = "Set1")
p <- p + scale_colour_brewer(type="qual", palette = "Set1")
p

If you want t larger version of an individual plot, print from the original plist from which the pdataframe was mde. For instance plot the detrended correspondence analysis (DCA), which is the second element of the list.

# Print DCA from original plist


plist[[2]]

Make it look nicer.

#Fluff it up

p <- plist[[2]] + scale_colour_brewer(type="qual", palette = "Set1")
p <- p + scale_fill_brewer(type="qual", palette = "Set1")
p <- p + geom_point(size = 5) +
         geom_polygon(aes(fill=SampleType))
p

PCoA on UniFrac Distances

You can also do PCoA on UniFrac distances

Use the ordinate function to perform weighted UniFrac then perform PCoA on the distance matrix. Then pass the data and the ordination results to plot_ordination to create the ggplot2 graphic withe default settings.

# get unifrac distances and pcoa them

ordu <- ordinate(GP1, "PCoA", "unifrac", weighted=TRUE)

# plot it

plot_ordination(GP1, ordu, color="SampleType", shape = "human")

Let’s pretty it up.

# Fluff up the graph

p <- plot_ordination(GP1, ordu, color="SampleType", shape="human")
p <- p + geom_point(size=7) 
p <- p + scale_colour_brewer(type="qual", palette = "Set1")
p + ggtitle("MDS/PCoA on weighted UniFrac distance, Global Patterns dataset")

Alpha Diversity Graphics

Although the command is to the plot_richness function, the function in fact calculates to more than to richness, in fact to all descriptions of alpha diversity.

First some setup

# Some setup for graphics

theme_set(theme_bw())
pal <- "Set1"
scale_colour_discrete <-  function(palname=pal, ...){
  scale_colour_brewer(palette=palname, ...)
}
scale_fill_discrete <-  function(palname=pal, ...){
  scale_fill_brewer(palette=palname, ...)
}

Prepare the Data

It is probably not a bad idea to prune OTUs that are not present in any of the samples (somehow that happens) but don’t trim more than that! Many richness estimates are modeled on singletons and doubletons, and you need to leave them in the dataset to get meaningful estimates.